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ABSTRACT 


This  report  discusses  a  digital  signal  sorting  technique  for  identifying 
digitized  speech  signals  in  an  environment  of  common  digital  data  signals. 

The  technique  is  modular  In  that  two  levels  of  recognition  are  possible:  1) 
a  linear  prediction  based  scheme  which  provides  a  fine-grain  recognition; 
and  2)  a  digital  word  synchronization  scheme  which  provides  a  coarse,  or 
first  cut  recognition.  In  the  fine-grain  scheme,  a  discrimination  measure 
defined  herein  is  shown  to  provide  good  sorting  of  the  digitized  speech 
sequence  -  this  is  demonstrated  through  examples. 
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CHAPTER  I  -  Introduction 

In  communication  networks  where  several  digital  sequences  representing 
different  analog  signal  sources  are  present,  it  is  important,  at  times,  to 
know  which  digital  sequence  comes  from  which  analog  source.  To  be  more 
specific,  in  communication  networks  where  pulse  code  modulation  (PCM)  and/or 
differential  PCM  sequences  for  speech  Raised  Cosine,  Partial  Response,  and 
Modem  Phase  Modulation  (Table  1)  are  present,  one  may  be  interested  in 
determining  when  the  digital  sequence  representing  the  analog  speech  signal 
is  present  so  that  the  proper  channel  conditioning  can  be  selected.  Due 
to  the  high  redundancy  in  the  analog  speech  signal  and  the  corresponding 
redundancy  of  the  bit  activity  in  the  digital  words,  the  conditioning 
required  for  reliably  transmitting  speech  signals  is  not  as  critical  as  it 
is  for,  say,  DPCM  Modem  Phase  modulation.  Since  the  amount  of  channel 
conditioning  required  determines  the  cost  per  channel  use,  this  cost 
factor  is  another  reason  why  it  is  important  to  be  able  to  sort  the  digital 
sequence  and  identify  the  analog  source.  Finally,  there  are  occasions  when 
the  identification  of  the  underlying  analog  source  is  the  goal. 

This  report  discusses  a  digital  signal  sorting  technique  for  identifying 
digitized  speech  signals  in  an  environment  of  common  digital  data  signals. 

The  technique  is  modular  in  that  two  levels  of  recognition  are  possible:  1) 
a  linear  prediction  based  scheme  which  provides  a  fine-grain  recognition 
and  2)  a  digital  word  synchronization  scheme  which  provides  a  coarse 
first-cut  recognition.  In  the  fine-grain  scheme,  a  discrimination  measure 
defined  herein  is  shown  to  provide  good  sorting  of  the  digitized  speech 
sequence.  This  is  demonstrated  through  examples. 

The  report  is  organized  as  follows.  Section  I  of  this  chapter  provides 
a  discussion  of  the  linear  prediction  technique  and  Section  II  provides  a 
statement  of  the  subject  problem  and  some  reasonable  assumptions  pertaining 
to  the  linear  prediction  scheme.  Chapter  II  covers  the  details  of  the  proposed 
solutions  which  include  the  linear  prediction  scheme  and  the  synchronization 
sorting  scheme.  Some  results  are  also  included  in  this  chapter.  The 
conclusions  are  presented  in  Chapter  III,  and  the  support  of  the  synchroni¬ 
zation  sorting  scheme  is  given  in  the  Appendix. 
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SECTION  I  -  Background  and  Concepts  of  Linear  Prediction  (LP) 

This  section  is  based  in  large  part  on  conclusions  reached  by 
J.B.  O'Neal  and  Raymond  W.  Stroh  [l].  They  conclude  that 

"DPCM  Systems  designed  for  speech  do  poorly  with  data  signals. 

For  example,  a  3-tap  DPCM  system  designed  for  speech  has  an  SNI 
of  about  11.4  dB  when  speech  is  applied  to  the  system  b*it  it 
has  an  SNI  of  -3.4  dB  when  a  data  signal  with  raised  cosine 
power  spectrum  is  applied." 

Furthermore,  it  is  stated  that 

" _ DPCM  system  designed  for  speech  would  have  a  slgnal-to-noise 

ratio  that  is  14.8dB  lower  with  the  data  signal  than  with  the 
speech  signal.  Such  a  system  would  perform  3.4dB  worst  than 
PCM  for  the  data  signal." 

Since  these  findings  are  based  on  a  DPCM  system  where  both  the  encoder 
and  the  decoder  have  predictor  coeffi*  !?nfr  that  are  optimum  for  speech 
and  are  known,  a  question  that  we  seek  an  answer  to  is  whether  a  similar 
advantage  can  be  realized  if  the  decoder  has  the  optimum  predictor 
coefficients  for  speech  and  the  coefficients  of  the  encoder  are  unknown. 
Block  diagram  wise,  we  have  the  following  (Figure  1). 


LEVEL-TO-  BINARY-TO- 

BINARY  CODER  LEVEL  DECODES 


UNKNOWN 


KNOWN 


Figure  1.  General  System  Model 


Let  the  set  {b^,  i  *  1,2,3}  be  the  unknown  predictor  coefficients 
and  the  set  {a^,  k  ■  1,2,3}  be  the  set  of  known  predictor  coefficients 
at  the  encoder  and  decoder  respectively.  Then  in  the  encoding  process 
with  s^  the  ith  sample  of  the  input  analog  signal  s(t),  the  estimate 
for  s^,  s^  is  given  by 


s  =  b1 s .  n  +  b  s  „  +  b„s  „  =  E  b  s  , 

1  X  i-  «u  1  j  1"*  J  a  *  K  1  •* 

k=l 


The  resulting  error  sequence  into  the  quantizer,  Q^is 


e  =  s  -s< 
i  i  i 


It  has  been  well  established  [2]  that  the  optimum  set  of  predictor 
coefficients  can  be  found  as  follows: 

The  mean  squared  prediction  error  is 

3 

a2  =  E  {(si  -  s±) 2 }  =  E  {(s1  -  E  bksi_k>2}  * 

Which  in  matrix  notation  can  be  written  as 


2  T  T 

0=1-  2B  G  +  B  RB  j 


where  B  = 


1  <KD  2) 

<MD  1  t|»(l) 

V(2)  <p(l)  1 


The  elements  of  G  and  R  are  the  auto  correlation  coefficients  of  the 
sequence  s^, 

<M|i-j|)=  E  {si  s^}* 


It  is  obvious  that  the  values  of  ip(’)  are  determined  by  the  autocorrelation 
function ♦  (t)  of  the  input  signal  s(t)  and  the  rate  at  which  s(t)  is  sampled 
to  form  the  sequence  {s.},  i.e.,  if  the  sampling  rate  is  1/T  Hz  then 


The  optimum  coefficients  are  given  by  finding  the  vector  J5  that  minimizes 

2 

0  .  This  is  given  by 

B  „  *  R-1G  * 

—opt  —  — 

This  further  states  that  the  minimum  error  is  given  by 


2  T 

0  .  =  1  -  B1  G 

min  — opt— 


T  -1 

1  -  G  R  G 


NOTE:  The  set  {a,  }  for  the  decoder  is  calculated  in  the  same 
k 

manner  as  were  the  set  {b,  }. 

k 


Now,  since  the  transmitted  sequence  (Fig.  1)  to  the  level-to-binary 

encoder  is  e,  +  q  ,  where  q.  is  the  quantization  noise  component,  and 
i  i  m  i 

e^  =  -  E  b  s^  the  output  of  the  binary-to-level  decoder 

is  e^  +  q^.  The  problem  as  stated  above,  arises  in  the  predictor  loop 
of  the  decoder  since  the  set  {a^}  of  coefficients  are  used  instead  of 
set  {b^}.  This  is  shown  in  Figure  2. 


From 

binary-to 
level  decoder 


ei  =  si  -  s± 


si  =  Si  °r  Si 


Si  +  qi 


% 

s 


i 


3 

=  Z 

k=l 


a,  sj  t 
k  i-k 


3 

=  Z  b 

k=l 


kSi- 


FIGURE  2.  Decoder  Predictor  Loop 


If  a^  ~  bj  for  each  1,  then  s  *  8,  and  the  output  continuous  time 

signal  s(t)  will  approximate  s(t)  very  closely.  On  the  other  hand,  if 
is  not  approximately  for  all  i,  the  output  will  not  be  a  good 

approximation  of  s(t). 

Since  our  overall  goal  in  this  study  is  to  identify  s(t)  ~  s(t) 
as  a  speech  signal,  our  problem  would  be  solved  if  we  knew  {b^}  ,  then 

i*V  /ss 

we  need  only  calculate  the  error  e^  ■  8^  -  s^,  and  compare  this  to  some 

threshold.  We  do  not,  however,  have  such  knowledge.  We,  therefore, 
must  seek  an  alternative  approach.  (Chapter  II  Section  I) 
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SECTION  II  -  Problem  Statement  and  Assumptions 


A.  Problem  Statement 

Let  eq(i)  be  the  output  of  a  channel  and  let  the  signal-to- 
additlve  noise  ratio  be  sufficient  for  good  detection  and  very 
low  probability  of  error.  (Both  errors  of  the  first  and  second  kinds). 
With  the  following  assumptions,  determine  whether  eq(i)  is  a 
quantized  version  of  an  analog  speech  signal  or  a  digitized  common 
data  signal. 

Assumption  1 

eq(i)  =  s±  -  §1  +  q± 

where  s^,  s^,  and  q^  are  as  defined  in  Section  I.  q^  may  be  zero. 
Assumption  2 

eq(i)  represents  the  output  of  an  encoder  in  either  a  PCM, 

DPCM,  or  ADPCM  system. 

Assumption  3 

The  data  rate  and  approximate  word  length  are  known. 

Assumption  4 

The  quantizers  used  in  the  encoders  are  optimum  for  some 
signal  type  (common  data  or  speech  signals). 

B.  Comments  on  Assumptions 

The  above  assumptions,  in  most  environments,  represent  an  absolute 
minimum  amount  of  available  information.  All  of  these  assumptions 
are  more  or  less  self  evident  but  we  shall  comment  further  on  (1) 
and  (4) . 

The  signal  eq(i)  merely  represents  the  generalized  signal  discussed 
in  Section  I,  and  when  q^  =  0,  eq(i)  simply  represents  a  system  without 
a  quantizer  or  no  quantization  error. 

In  (4)  it  is  considered  good  design  practice  to  design  the  quantizer 
with  full  concern  given  to  the  spectral  shape,  i.e.,  the  probability 
density  function  of  the  amplitude  of  the  applied  signal.  Now  on  to  the 
proposed  solution. 
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CHAPTKK  II  -  Proposed  Solutions 


SECTION  I  -  Linear  Prediction  (LP)  Sorting  Scheme 

The  Concept 

Subsystems  and  k^  of  Figure  3  are  1  and  3  tap  DPCM  Systems  respec¬ 
tively  with  coefficients  that  are  optimum  for  speech,  and  subsystems 
and  B2  are  similar  for  DPCM  Systems  with  coefficients  that  are  optimum 
for  one  of  the  common  data  signals  (See  Tables  I  and  II). 

The  discrete  signal  eq(i)  is  fed  into  each  of  the  five  subsystems. 
Since  the  PCM  subsystem  Is  nonadaptive,  It  is  shown  simply  as  a  direct 
feed-through  of  eq(i).  That  Is  the  PCM  decoder  output  Is 

A 

eq(i)  -  st  -  +  qt  +  n 


Call  this  r^,  l.e., 

r^  *  s^  -  s^  +  q^  +  n,  n  is  additive  channel  noise.  (1) 

The  outputs  of  the  oilier  s  bsystems  are  shown  In  Figure  3,  and  are  given 
as 


r.  *  eq (1)  +  s 
L  *1 

r  »  eq  (1)  +  “s. 

J  2  (2) 

r^  =  eq  ( 1 )  +  s^ 

and 

rs  =*  eq  (1)  +  s 
3  l2 


where  s  is  the  prediction  of  the  1  tap-predictor  with  optimum  coefficient 
11 

a.  for  speech,  8  is  the  prediction  of  the  3  tap-predictor  with  optimum 
1  l2 

coefficients  a.,  a  and  a_  for  speech,  and  s  and  s  are  similarly  defined 
1  1  i  t2 
for  a  common  data  signal. 

The  Threshold  Decision  Devices  (TDD)  provide  the  inputs  into  the 
decision  function  f(p,q-n,/)  (Figure  3),  which 


FIGURE  3:  BLOCK  DIAGRAM  OF  PROPOSED  SOLUTION  (LP) 


*****  p-1,2,3,4, 3,6, 

7  end  1-1, 2, 3,4, 5,6, 

7  depending  on  the 
type  of  Input  desired 
end  type  of  Input 
expected,  respec¬ 
tively.  q  end  n 
denotes  the  type  of 
coefficients 
instelled  In* sub¬ 
systems  A  end  B 
respectively. 


1  -  Speech 

2  -  Relsed  Cosine  o-0  f(p.q-n  l>-exp(TDD^.e-TOD^>TDD"  <fTDDn  1 

3  -  Relsed  Cosine  0-.5  41  41  i3  twl4J 

*  -  Co. In,  o-l  -7T  «p(TCDS  ]^«xp|TDDl.l 

5  -  Partial  Raaponaa  <■!  Jj»3  *3 

6  -  Modes  1 

7  -  Modes  2 


.♦-.E-iiartrdratiao 


it  r  r 

simply  forms  Che  ratios  2/r^,  3/r^  and  5/r^  and  make  decisions  (yes  or  no) 
whether  the  input  signal  eq(i)  represents  a  speech  signal.  We  can  get  a  feel 
of  the  components  of  these  ratios  by  considering  two  cases: 


Case  1:  s  =  s  ,  Speech, 
l  i2 


r3  =  Si 


si  +  +  n  ■  81  +  q  +  n. 


(3a) 


Si  +  qi  +  n 


8i'  si  +  qi  +  n 


Since  ~  s  .  for  optimum  coefficients  in  the  encoder,  the  ratio  becomes 


8i  +  qi  +  n 

qt  +  « 


(Signal  plus  noise)  (3b) 

noise  ratl°  » 


For  the  1  tap  predictor,  si  does  not  approximate  s  as  well  as  *s  due  to 

11  1  l2 
the  coarseness  of  the  estimate,  r^  is  still  given  as  r^  *  q^  +  n,  but 


r2  =  8i 


8i  +  +  n  +  8l 


(3c) 


Since  s  does  not  necessarily  cancel  s  ,  and  since  we  have  assumed  s 
11  1  1 
comes  from  a  3  tap  predictor,  the  ratio 


8i  +  qi  +  n  +  8ii  - 
™  +  n 


st  +  qt  +  px 

qi  +  n 


(3d) 


where  n.  is  the  total  additive  noise  n  +  s  -  s  . 

ll  1 

If  st  >  s^  ,  which  is  generally  the  case,  *2/^  <  *3/^,  which  implies 

the  r3  >  r2  as  required  by  the  conclusions  of  O'Neal  and  Stroh,  i.e.,  we 
get  a  better  signal  to  noise  ratio  out  of  the  3-tap  predictor  system  than 
out  of  the  1  tap  predictor  system. 
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For  and  r^,  the  outputs  of  the  1  tap-predictor  and  the  3  tap-predictor 
respectively,  whose  coefficients  are  optimum  for  some  common  type  of  data 
signal,  the  ratios  are 


and 


8i  ‘  “i  +  qi  +  °  +  8ij 
+  n 


8i  -  8i  +  qi  +  n  +  8i, 


+  n 


S1  +  +  n2 
q.  4-  n 


8i  *  qi  +  n3 
qt  +  n 


(3e) 


where  n9  -  n  +  s  -  s  and  n«  *  n  +  s,  -  8, 
i  li  i  J  *2  1 

are  the  total  additive  noise  out  of  the  two  systems. 

A  ™  A  “ 

Again,  we  can  state  that  s  >  s  and  s,  >  s  which  implies  that 

1  11  1  12 


x.  <  r„,  rc  <  r„,  r.  <  r„,  and  r_  <  r_. 

4  2  5  2’  4  3  5  3 


W 


Therefore  for  this  case  we  see  that  the  decision  devices  could  arrive 
at  the  correct  decision,  i.e.  eq(i)  is  speech,  simply  by  confirming  the 
inequalities  in  equation  4. 


A 

Case  II:  a.  ■  s  ,  nonspeech. 
i  i2 

For  this  case  the  roles  of  the  ratios  are  switched,  i.e. 

^2  <  r4»  r2  <  r5>  r3  <  r4»  an<*  r3  ^  r5  *  (5) 

These  are  true  for  the  same  reasons  as  stated  in  Case  I.  Also,  the  correct 
decision  could  be  arrived  at  using  the  same  criteria  as  in  Case  I  for  eq.  (5) 

We  analyse  the  design  in  Figure  3  for  common  signals  in  a  telecommunication 
network  (Tables  I  and  II)  in  the  following  subsection. 
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TABUS  IV;  Ratio  Data  ior  Ilia  1-tap 
Pradlctor  (DPCM)  at id  tha  PU4 


Discussion  of  (LP)  Solution  and  Results 

As  seated  above,  the  data  for  analyzing  the  model  are  given  in 
Tables  III  and  IV.  These  are  sunnarles  of  the  ratio  information  for  the 
signal  combinations  in  Tables  I  and  II. 

We  consider  an  example  in  order  to  explain  the  concept  in  more  detail. 

Example: 

Reference  Figure  3  and  column  2  of  Table  II. 

Select  the  coefficients  in  subsystem  A  to  be  as  follows: 


A^  :  a^  •  1.936 


A2  :  a.  -  1.936,  a2  -  -1.553,  a3  -  .4972 
These  are  the  optimum  speech  coefficients. 


Subsystem  B  has  coefficients 


Bx  :  bx  -  .518 

B2  :  bx  -  .518,  b2  -  -.7834,  b3  -  .2165 

These  are  the  optimum  Partial  Response  (PR)  coefficients 

Now  if  the  applied  signal  to  the  system  shown  in  Figure  3  is 
speech,  then  the  Threshold  Decision  Devices  (TDD)  ratios  are 


r2  r3  r4 

—  -  1.99986,  —  -  3.7196,  -=  -  .85212,  and 

rl  rl  rl 


—  -  .69343 
rl 

Reference:  Table  IV,  entry  speech/speech,  for  2/r^ 

Table  III,  entry,  speech/speech, for  r3/r. 

Table  IV,  entry,  speech/Partial  Response,  for  4/r^ 
Table  III,  entry,  speech/Partial  Response,  for  r5/r. 


If  however,  the  input  is  representative  of  a  PR  data  signal, 
we  get: 

r2  r3  r4 

— ~  -  1.217587,  — ^  -  1.03753,  —  -  1.023293,  and 

rl  rl  rl 

r5 

— -  -  1.47741* 

rl 

The  decision  function  representing  all  the  TDD  outputs  is  defined  as 

f(p,q-n,  1)  =  $  exp  [TDD?]  4  exp  [TDD^  ] 
j-1  j  j-3 

where  p  *  1,2, 3, 4, 5, 6, 7  and  l  ~  1,2, 3, 4, 5, 6, 7  depending  on  the  type 
of  input  desired  and  type  of  input  expected,  respectively. 

For  our  example,  where  the  input  is  assumed  to  be  speech  and  the 
coefficients  in  subsystems  A  and  B  are  for  speech  and  Partial  Response 
respectively,  the  decision  function  is  (4  *  1  speech)  p=l 

f(l,  1-5,  1)  -  exp[  TDD^  +  TDDj2  +  TDD*3  +  TDD^] 

-  exp  [  1.99986  +  3.7196  +  .85211  +  .69343] 

=  exp (7. 26501) 

=  1429.3999 

The  decision  function  for  l  *  5,  Partial  Response,  is  (p=l) 

f  (1,1-5, 5)  -  exp  [  TDD^  +  TDD g2  +  TDD^  +  TDD^] 

-  exp  [1.217587  +  1.03753  +  1.023293  +  1.47741] 

-  exp(4.7558) 

-  116.2589 
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Now,  if  the  input  signal  is  neither  speech  nor  Partial  Response, 
but,  say,  Raised  Cosine  o  -  .5  (£  ■  3),  the  decision  function  gives 

f (1,  1-5,  3)  »  exp  [  TDD^  +  TDD^  +  TDD^  +  TDD^] 

-  exp  [  .81941  +  .67298  +  1.009253  +  1.90985] 

-  exp  (4.41149) 

-  82.3924 

From  a  comparison  of  these  calculations,  it  is  quite  evident  that  when 
the  input  signal  is  speech, 

f(l, 1-5,1)  >  f (1, 1-5, 5) , 

and 

f(l, 1-5,1)  >  f(l, 1-5,3) 

In  other  words,  the  discrimination  capability  of  the  decision  function 
is  good  when  the  optimum  coefficients  for  speech  are  loaded  in  one  of  the 
subsystems. 

A  question  that  remains  to  be  answered  is  whether  it's  possible  to 
get  a  decision  function  approximately  equal  to  f(l, 1-5,1)  when  the 
optimum  speech  coefficients  are  not  loaded  in  either  subsystem.  The 
answer  to  this  question  is  no!  Simply  because  in  both  Table  III  and 
Table  IV,  the  largest  entry  for  each  signal  type  is  on  the  main 
diagonal.  For  example,  the  Modem  1  column  entry  (across  the  top  of  Table  III) 
1.6672  corresponding  to  Modem  1  row  (entries  on  left  of  Table  III)  is 
greater  than  all  other  column  elements,  i.e.,  Partial  Response  under 
Modem  1  is  1.64627.  In  other  words,  each  signal  type  considered  has  a 
higher  correlation  with  Itself  than  with  any  of  the  other  signal 
types  in  the  set. 

Another  question  is  whether  it's  possible  to  realize  "good" 
discrimination  between  the  possible  signal  types  if  the  decision 
function  is  calculated  based  only  on  the  output  of  subsystem  A  or 
subsystem  B.  For  our  specific  case,  i.e.,  discriminate  speech  from 
other  signal  types,  below  we  see  the  answer  to  be  affirmative. 


Subsystem:  (Speech  input  -  coefficients  speech) 

f(l,l,l)  =  exp  (1.99986  +  3.7196)  =  exp  (5.71946)  =  304.74 

(Speech  input  -  coefficients  Partial  Response) 
f (3,5,1)  =  exp  (.85211  +  .69343)  =  exp  (1.54554)  -  4.6905 

(Partial  Responses  input  -  coefficients  speech) 
f (1,1,5)  -  exp  (1.217587  +  1.03753)  =  exp  (2.255117)  =  9.536 

(Partial  Response  input  -  coefficients  Partial  Response) 
f (3,5,5)  =  exp  (1.023293  +  1.47741)  =  exp  (2.500703)  =  12.19106 

(Raised  Cosine  (u  =  .5)  input  -  coefficients  speech) 
f (1,1,3)  =  exp  (.81941  +  .67298)  =  exp  (1.49239)  =  4.44771286 

(Raised  Cosine  (a  =  .5)  input  -  coefficients  Partial  Response) 
f (3,5,3)  =  exp  (1.009253  +  1.82389)  =  exp  (2.81314)  =  16.6622 

From  these,  we  see  that  f (1,1,1)  is  greater  than  all  the  other 
combinations  considered,  i.e.,  with  f (1,1,1)  =  304.74  the  closest 
to  it  is  f(3,5,3)  =  16.66.  Thus  discrimination  of  speech  is  also 
"good"  when  only  a  single  subsystem  decision  function  is  used.  On 
the  other  hand,  in  comparing  f(3,5,3)  and  f(3,5,5),  we  see  that  if  the 
input  is  Raised  Cosine  (<j  =  .5)  and  the  coefficients  are  Partial  Response, 
f(3,5,3)  =  16.66,  where  as  for  a  partial  Response  input  with  Partial 
Reponse  coefficients  f (3 , 5 , 5 )  =  21.191.  Thus,  we  cannot  discriminate 
a  Partial  Response  signal  from  the  Raised  Cosine  (a  *  .5)  signal  by 
simply  comparing  decision  functions. 

An  underlining  assumption  in  the  above  discussion  is  that 
synchronization  of  the  receiver  to  the  transmitter  sequence  (transmitter), 
has  been  accomplished.  We  suggest  an  approach  to  synchronization  in 
Section  II.  This  synchronization  scheme  also  provides  a  digitized  speech 
signal  sorting  scheme. 


r" 


SECTION  II  -  Synchronization  Sorting  Scheme 
THE  CONCEPT 

This  second  sorting  technique  is  essentially  a  word  synchronization 
scheme.  It  is  a  modification  of  a  word  synchronization  method  proposed 
by  T.  Fukinuki  [3]  for  PCM  T.V.  transmission.  This  method  takes  advantage 
of  the  high  correlation  that  exists  between  adjacent  samples  of  speech 
signals,  and  the  resulting  frequency  of  changes  in  the  respective  bits  of 
adjacent  words.  Since  most  quantizers  used  in  PCM  and  DPCM  systems  are 
locally  linear  (for  speech),  the  probability  of  a  change  in  the 
least  significant  bit  (LSB)  of  a  word  is  higher  (generally  much  higher) 
than  a  change  in  the  most  significant  bit  (MSB)  between  adjacent  words. 
These  facts  provide  the  bases  for  a  digital  word  synchronization  scheme 
discussed  below. 


Let  the  word  length  be  seven  bits.  Denote  a  typical  word  by 

(x,x„x~x, xcx,x_) ,  where  x.  ■  0  or  1  for  the  binary  system.  We  are  ouly 
1  Z  J  4  j  O  /  1 

interested  in  the  binary  system  here.  Denote  a  string  of  three  successive 
seven  bits  words  as 


(x 


1111111. 

1x2x3V5x6x7) 


(word  1) 


(x 


2  2  2  2  2  2  2, 

1X2X3X4X5X6X7) 


(word  2) 


,  3  3  3  3  3  3  3, 
(x1x2x3x4x5x6x?) 

(word  3) 


superscript  denotes 
word,  subscript  denotes 
bit  location  (6) 


12  3  12 

where  bits  x^,  x^,  are  the  MSB  of  the  three  words,  and  x^,  x^,  and 

3 

x^  are  the  LSB  of  these  words.  [Word  1  and  word  2  are  adjacent  and  so  are 


word  2  and  3.  Word  1  and  word  3  are  called  successive.  Therefore,  any 
word  that  is  not  adjacent  to  the  word  of  interest  will  simply  be  referred 
to  as  a  successive  word  with  the  proper  superscript .3 

Now,  correlation  between  words,  adjacent  or  successive,  are  reflected 
in  changes  in  the  LSB  and  the  next  LSB  (NLSB)  more  than  changes  in  the  MSB 
and  the  next  MSB  (NMSB) .  This  is  due  to  the  continuous  time  nature  of  the 
speech  generation  process.  Moreover,  changes  in  MSB  and  NMSB  and  not  in 
LSB  and  NLSB  characterizes  a  discontinuous  time  process  which  is  not  gener¬ 
ally  present  in  the  speech  signal.  This  correlation  that  exists  between 
adjacent  words  can  also  be  discussed  in  statistical  terms,  and  a  digital 
word  synchronization  scheme  can  be  realized  by  considering  the  probability  of 
changes  in  the  value  of  specific  bits  in  adjacent  words.  (See  Appendix).  A 
system  designed  around  the  relative  probabilities  of  bits  one  and  six  of  a 
seven  bit  word  is  discussed  below.  (Figure  4) 
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Block  Diagram,  Theory  of  Operation  and  Alignment  Algorithm 

A  realization  of  the  concept  discusaed  above  is  shown  in  Figure  5  and 
a  discussion  of  the  Theory  of  Operation  follows. 

The  data  sequence  of  interest  enters  the  system  through  gate  G1  which 
is  enabled  by  Start  Processing  Gate  Control  (SPGC).  SPGC  is  simply  a  control 
signal  that  opens  Gl  whenever  a  synchronization  sorting  decision  (SSD)  is 
needed.  The  output  of  Gl  feeds  G2  and  G3.  G2  is  a  gate  that  establishes 
when  Shift  Register  (SR)  is  to  be  loaded  serially.  The  SR  (series)  Load 
Gate  enables  G2.  G3  is  the  word  align  Gate  which  has  the  function  of 
preventing  the  processing  sequence  from  entering  the  EXCLUSIVE  -  OR  G4 
before  SR  is  loaded.  The  Shift  Register  (SR)  is  loaded  serially  via  the 
output  of  G2.  SR  is,  for  our  discussion,  seven  bits  long;  however,  its 
length  can  be  adaptive  to  accommodate  several  word  sizes.  Once  SR  is  full, 
the  seven  bits,  the  assumed  word,  is  transferred  into  Parallel  Storage  (PS) 
so  that  the  loaded  word  may  be  compared  with  the  adjacent  word  as  well  as 
with  other  successive  words.  Galt  G4  is  an  EXCLUSIVE  -  OR  in  which  the 
stored  assumed  word  in  compared  bit-by-bit  with  other  succeeding  words. 

A  typical  comparison  may  progress  as  follows: 

Assume  that  twenty-one  successive  bits  of  the  input  sequence  are  the 
three  seven  bit  words  given  eq  (6)  above,  i.e., 


11  1111122  _ 

X.X,  X_X.  XCX,X,X,  X-  X,X.  X,X,X-,X.  X 

1  2|  3  4  5  6  7  1  2|  3  4  5  6  7  1 


B 


2222233  333334 

C 


X3X4X5X6X7X1‘ 


(7) 


where,  of  course,  we  do  not  know  the  subscripts  and  superscripts.  Also 
assume  the  seven  bits  loaded  in  SR  are  those  from  A  to  B,  where  x^  is  the 
2  ' 

MSB  and  x^  is  the  LSB.  ORing,  therefore,  will  first  occur  in  G4  between  bits 
2  3 

Xg  and  x^,  since  the  assumed  word  from  A  to  B  will  be  compared  against  the 

next  word  out  of  G3  which  consist  of  the  bits  from  B  to  C.  The  output  of 
G4  is  a  sequence  of  bits  representing  bit  changes  that  has  occurred  between  the 
two  assumed  words.  As  we  stated  previously,  if  the  assumed  words  are  true 
digital  speech  words,  then  the  probability  of  a  change  in  the  MSB  from  one 
word  to  the  next  will  be  much  lower  than  the  probability  of  a  change  taking 
place  in  the  LSB  position.  The  details  of  the  decision  process  which  uses 
the  output  of  G4  are  given  next. 
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Gates  G5,  G6,  G7,  and  G8  are  conventional  two  input  AND-gates  and 
their  inputs  are  from  G4  and  a  ring  counter.  The  ring  counter  is  strapped 
so  that  BIT  #6  is  fed  into  G5,  BIT  #5  goes  to  G7 ,  BIT  #2  goes  to  G8,  and 
BIT  #1  goes  to  G6.  This  arrangement,  therefore,  assigns  G5  as  MSB  gate, 

G6  as  LSB  gate,  G7  as  NMSB  gate,  and  G8  as  NLSB  gate.  A  bit  change  indica¬ 
tion  out  of  C4  enables  G6  allowing  BIT  #1  out  of  the  ring  counter  to  drive 
the  Up/Down  Counter  #1  (U/DC#1)  one  step  down.  (If  there  is  no  bit  change, 
the  counter  does  not  step.)  The  next  bit  out  of  G4  is  compared  with  BIT  # 2 
in  G8,  and  if  both  of  them  are  HIGH  (logical  "1")  UP/Down  Counter  #2 
(U/DC#2)  steps  down  one  bit.  This  pattern  continues  for  the  MSB  and  the 
NMSB  in  gates  G5  and  G7,  respectively.  If,  for  each  of  the  four  (4)  bits 
checked,  a  change  has  occurred  from  assumed  word  1  to  word  2,  the  net 
effect  is  that  both  U/DC#1  and  U/DC#2  are  back  in  their  initial  states, 
i.e.,  MSB  gives  a  up-count  of  one  on  U/DC#1  while  LSB  gives  a  down-count 
of  one. 

As  stated  above,  when  the  word  loaded  in  SR  is  a  true  digital  speech 
word,  one  expects  a  change  in  LSB  more  frequent  than  in  MSB  (something  like 
8  times  more  frequent  for  a  six  bit  word).  Likewise,  there  are  more  frequent 
changes  in  NLSB  than  in  NMSB.  These,  therefore,  force  U/DC//1  and  #2  to 
count  down  more  frequent  than  count  up,  which  gives  an  in-sync  pulse  condition 
as  long  as  this  pattern  exists.  If  the  MSB  and/or  the  NMSB  change  more 
frequently  than  the  LSB  and/or  NLSB,  U/DC#1  and/or  U/DC#2  count (s)  up 
leading  to  an  out-of-sync  condition.  U/DC#2  is  a  shorter  counter  than 
U/DC//1,  and  its  output  acts  as  a  flag  of  a  pending  out-of-sync  condition. 
U/DC#1,  the  long  counter,  indicates  a  definite  out-of-sync  condition,  and  its 
pulse  is  used  to  inhibit  gates  G2,  G3,  and  G9.  The  length  of  this  inhibit 
gate  is  a  function  of  the  number  of  clock  pulses  that  exist  between  U/DC#2 
output  and  U/DC#1  output.  A  functional  relationship  that  may  be  used  to 
establish  the  length  of  the  inhibit  gate  is  to  make  it  equal  to  a  word  length 
minus  one  divided  by  two,  that  is, 

(word  length  -  l)/2  ■  inhibit  gate  length. 


A  detailed  discussion  follows. 


The  twenty-one  successive  bit  sequence  in  (7)  Is  repeated  below 
with  the  assumed  words  indicated  as  before.  (Word  1  between  points  A  and 
B,  and  word  2  between  points  B  and  C) 


11  111  11  22  222  22  33  333  33 


xlX2f3*4X5f6X7, 

,xlx*x3x4x5f6*71 

X1X24X3X4X5*X6X7'‘ 

A  1 

B  ! 

c  I 

A' 

B' 

C' 

B" ' 


C" 


1 


(8) 


For  word  A'B',  x*  is  MSB  and  is  NMSB,  but  in  reality  x^  is  NLSB  and 

12  2  2 
x^  is  LSB.  Also  x^  is  NLSB  and  x^  is  LSB,  but  in  reality  x^  is  next  next 

NLSB  (NNNLSB)  and  x^  is  next  NLSB  (NNLSB) .  Therefore,  MSB  (NLSB)  changes 
more  frequent  than  LSB  (NNLSB)  and  NMSB  (LSB)  changes  more  frequent  than 


NLSB  (NNNLSB).  These  conditions  mean  U/DC#1  counts  up  and  U/DC#2  also  counts 


up.  The  strategies  used  in  realigning  the  word  based  on  this  assumed  word 
alignment  and  others  are  given  in  Table  V  below. 


Table  V  Decision  Strategy  for  Word  Alignment 


WORDS:  AB,  BC  -  initial  LOAD 

^ WORDS:  AB,  BC  -  Alignment  inhibit  gate  three  (3)  bits  long  (U/DC#1 
down  and  U/DC#2  down) 

^WORDS :  A'B1,  B'C'  -  Alignment  inhibit  gate  two  (2)  bits  long  (U/DC#1 

up  and  U/DC#2  up) 

^ WORDS :  A"B" ,  B"C"  --Alignment  inhibit  gate  one  (1)  bit  long  (U/DC#1 

(fast)  and  U/DC#2  (slow)  down) 

SjORDS:  A"'B'",  B"'C"'  -  Alignment  inhibit  gate  zero  (0)  bits  long 

(U/DC#1  NORMAL  and  U/DC#2  down)  (In-Sync 
condition) 

5  IV  IV  IV  IV 

WORDS:  A  B  ,  B  C  -  Alignment  inhibit  gate  six  (6)  bits  long 

(U/DC#1  (Slow)  up  and  U/DC#2  (Fast)  down) 


The  contents  of  Table  V  are  self  explanatory,  but  a  comment  about 
strategies  (3),  (4),  and  (5)  are  In  order. 

In  (3),  U/DC#1  (FAST)  up  means  this  counter  counts  up  at  or  near  the 
clock,  rate,  and  U/DC#2  (SLOW)  down  Indicates  the  counter  counts  down  at  or 
near  1/8  the  clock  rate. 

In  (4)  U/DC#1  normal  means  at  or  near  1/8  the  clock  rate  either  up  or 
down,  and  U/DC#2  counts  down  at  a  constant  rate  near  the  clock  rate  -  this 
leads  to  the  output  of  an  in-sync  pulse. 

In  (5),  U/DC#1  (SLOW)  up  means  this  counter  counts  up  at  or  near  1/8 
the  clock  rate,  while  U/DC#2  (FAST)  down  counts  at  or  near  the  clock  rate. 

Note  that  once  words  A" ' B" '  and  are  attained,  the  system  has 

recognized  the  input  sequence  as  being  digitized  speech,  or  a  signal  of 
similar  analog  structure.  Also  note  that  the  system  has  been  synchronized 
to  the  input  sequence  and  the  resulting  word  strings  are  now  ready  to  be  fed 
into  the  LP  subsystem  discussed  above.  The  combined  capabilities  of  both 
systems  provide  a  two  (2)  level  speech  recognition  scheme  consisting  of  a 
real-time  partial  recognition  decision  after  word  synchronization  is  attained, 
and  a  fine-grain  total  recognition  decision  in  the  linear  prediction  sub¬ 
system. 


CHAPTER  III  -  Conclusion 

We  have  presented  the  theory  and  design  of  a  technique  that  will  allow 
the  sorting  of  digitized  speech  (PCM  and  DPCM)  from  other  common  digital 
data  signals  in  communication  networks,  i.e.,  RF  radio  and  telephone  cable 
systems.  The  proposed  technique  consist  of  two  subsystems:  1)  a  word 
synchronization  sorting  scheme  that  identify  word  boundaries  in  an  input 
digital  sequence*  and  2)  a  signal  processor  that  transforms  the  words  of 
scheme  (1)  into  a  quasi-analog  signal  which  is  then  passed  through  four  (4) 
subsystems  whose  amplitude  weights  are  known  and  can  be  changed  at  will. 

The  four  (4)  subsystem  outputs  in  conjunction  with  the  original  quasi¬ 
analog  signal  are  used  in  a  decision/thresholding  algorithm  to  decide 
whether  the  input  sequence  is  digitized  speech  or  not. 

The  decision  algorithm  was  tested  via  several  examples  and  the  results 
indicate  good  discrimination  of  the  digitized  speech  signal  from  the  common 
communication  network  signals.  Because  of  the  nonavailability  of  a  high¬ 
speed  analog-to-digital  converter,  i.e.,  PCM/DPCM  subsystems,  we  were  not 
able  to  conduct  the  planned  digital  computer  simulation.  It  is  felt,  however, 
that  if  the  hardware  is  available,  i.e.,  micro-processor  for  control  and 
PCM/DPCM  subsystems,  actual  field  test/lab  test  would  be  better. 
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APPENDIX 


Consider  a  six  bit  digital  word  sequence. 

Let  the  probability  that  bit  i  of  word  j  be  p(x^)  and  the  probability 

Ir 

of  bit  i  of  word  k  be  p(x^). 

1 

If  we  let  p(xp  >  p(x^),  then  Shannon's  information  measure  says  that 
bit  i  and  word  j  provides  less  information  than  the  same  bit  in  word  k. 

If  we  assume  within  the  same  word,  j,  bits  n  and  m,  and  we  say  that 

p(xj)  >  p(x^), 
n  m 

then  bit  m  provides  more  information  about  the  behavior  of  word  j  than  bit  n. 

For  example,  the  probability  of  a  change  in  the  LSB  is  .4,  and  the 
probability  of  a  change  in  the  MSB  of  the  same  word  is  .05  [3].  Then, 
Shannon's  information  defined  as 

I  »  -log^  l/p(x^)  (number  of  bits  of  information), 

gives, 

LSB:  ILSB  -  -log2(,4)  -  1.32  bits, 

and 

MSB:  »  -log2  (.05)  -  4.32  bits. 

If  we  consider  even  longer  words,  where  there  are  more  bits  between  the 
LSB  and  the  MSB,  then  1^^  will  be  much  larger  than  1^^.  There  is,  however, 

a  definite  limit  on  the  length  of  words  that  can  be  used  since  the  words 
must  transfer  some  information.  The  trade-off  of  word  length  and  message 
transmission  rate  is  beyond  the  scope  of  the  Appendix.  One  can  consider 
the  trade-off  accuracy  and  timing  of  decision  versus  the  time  delay  in 
noting  the  change  in  MSB.  Maybe,  a  change  in  some  bit  between  MSB  and  LSB 
is  sufficient 


= 
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